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1 Introduction 

The main goal of this activity is to provide certain new kinds of media that extend the functionalities of 
available standard technology. A key feature of these media is interactivity in the sense that the user shall have 
the possibility to chose his own viewpoint within a visual scene. Another feature covered by this activity is ste- 
reo vision that gives the user the impression of a 3D view of a visual scene. 3D Video can be defined as geomet- 
rically calibrated and temporally synchronized (group of) video data. Another definition might be image-based 
rendering using video input data or video-based rendering. This includes corresponding 3D Audio as well, which 
will also be considered in this activity. 

So far 3DAV includes 5 main categories of scene representations: 

1. Ornni-directional (panoramic) video 

This is an extension of the planar 2D image plane to a spherical or cylindrical image plane. Other kinds of 
planes (hyperbolic) are also possible. Video is captured at a certain viewpoint (which may move over time) into 
every direction. Any 2D view in any direction can be rendered from this representation. It can be applied to 
broadcast and storage (e.g. DVD) applications. 

2. Interactive stereo video 

2 views, one for each eye. are provided, to produce a 3D impression for the viewer. Head morion parallax can 
be supported to enable interactivity (in a certain operating range). It can be applied to broadcast, storage (e.g. 




DVD), and .communication applications. This can be considered as a special case of interactive multiple view 

video. 

3 . Interactive multiple view video 

In this case a scene is captured by N cameras. Different camera settings are possible, e.g. parallel view, con- 
■ vergent view, divergent view, but in general any setting of cameras (e.g. combinations of the above) is possible, 

i.e. multiple view. Additional information (to the N video signals) about camera calibration and scene geometry 
, (e.g. disparity data) enables interactive navigation through the scene. A simple case allows only to chose one of 

the predefined camera positions. In general one. 2D view (for conventional displays) or 2 views (for stereoscopic 
-displays) can be rendered from the data. This can be applied to broadcast and storage (e.g. DVD) applications, in 

simple cases also for communication applications. - 

4. 3D video objects 

A scene is captured as in multiple view video (see 3), and one or more 3D video objects are created. A 3D 
video object comprises shape and appearance. The shape can be described by, e.g., polygon meshes, implicit 
surfaces, depth images, or multiple layered depth images. The appearance data is mapped onto the shape and 
allows the 3D video object to be seamlessly blended into new 2D or 3D video content. Appearance is typically 
described by a series of video streams, comprising textures, surface light fields (i.e., view-dependent textures), or 
surface reflectance fields (i.e., illumination- and view-dependent textures). The 3D video object can be compo- 
sited into existing content, or it can be interactively viewed from different directions, or under different illumina- 
tion. 3D objects can be applied to broadcast, storage (e.g., DVD), and interactive online applications. 

5. 3D Audio 

As the viewpoint of 3D video moves, the listening point and/or sound source position are/is also moves. 3D 
sound can be recorded by several ways. 

First, multiple stereo microphones with cameras can make multiview sound. The multiview sound can be 
simply manipulated as movement of viewpoint, but this scheme needs huge memory for transnnttmg/storrng all 
sound objects. 

Second, 3D microphone can record all directional sound. In this case, it freely changes the hstening point and 
sound source position, but it needs high computational load. 

Third, object-based microphones can record 3D sound at their own positions by uni-directional microphones 
and the results can be coded individually. These sound objects need sound scene composition tool such as 
MPEG-4 advanced AudioBIFS to make a 3D sound scene. Each sound object needs 3D positioning tool such as 
HRTF rendering for mapping a monaural sound to 3D sound space. Thus it needs also 3D sound postion descrip- 
tion tools. 




Parallel View Convergent View Divergent View 



Fig. 1 . Types of spatial camera configuration for multiview video 

Figure 2 shows a possible classification of 3D video, using acquisition, representation and display as criteria. 
The examples below show some concrete instantiations of this concept 




Fig.2. Classification of 3D video 
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Omni View Video Camera 

• # of Cameras = 1 

• Spatial relation: Divergent 

• Screen Geometry: Non-Planar 


Raw Video 

(1 Video Streams) 


Mono-Scopic Display 


Exan 


mle 4 Free Viewpoint Video ( I) — . 


Acquisition 


Representation 


Display 




Video Camera 

• # of Cameras =16 

• Spatial relation: Convergent 

• Screen Geometry: Planar 


Processed Video 
(Image Based 
Rendering Model) 


Mono-Scopic Display 


Free 


Viewpoint Video (D 


Acquisition 


Representation 


Display 




2D Video Camera 

• # of Cameras = 13 

• Spatial relation: Convergent 

• Screen Geometry: Planar 
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Three different kinds of interactivity with the video can be distinguished: 
Interaction at the encoder side 
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In this case, the end user selects the viewpoint by remotely interacting with the encoder side. Also for fee re- 
duction of required channel bandwidth, the display type information can be used m display-dependent ^coding. 
It can be applied to 11 AV delivery model such as video communications, VoD, webcasting, en,. In c-se or 
VoD aud Scasting, in order to serve multiple users simultaneously,- the sender should have multip le encoders 
in case of real-time encoding or have different compressed files depending upon viewpoints and display types m 
Sse of non-realtime encoding. The backchannel information can also be used in the multiplexing process to 
format proper bitstreams for video data required by users. 

2 Interaction with all data available at the decoder side 

.. ' In this case the end user has all video and additional data available and can navigate freely ^within the scene . 
This is practical for storage (e.g. DVD) applications and all kinds of interactive v ld eo. Practical « 
broadcast applications and omni-directional (panoramic) and interactive stereo video are possible. Broadcast of 
interactive multiple view video might be impractical due to huge amount of data (increases with N). 

3 Interaction without all data available at the decoder side ' 

' Here the end user does not have all video and additional data available. In this case, the video data of all 
viewSnuTare compressed at the encoder side but only the bxtstreams of video with user's requested view- 
3s) and display P types are transmitted to the decoder side. Hence, free navigation within the scene ^equm^ 
backchannel winch ^impractical for broadcast applications. This is also out of scope fo r sto rage (e.g DVD) 
applications. However,luch an approach might be appropriate for streaming (e.g. Internet, Client/Server) appli- 
cations to avoid initial download of all data. 

1 .1 3D- Video System Architecture 

Fig 3 shows the block diagram of the 3D-Video system architecture. Coloured blocks indicate parts that have 
to be considered in the standardization process. The blocks have the following functionalities: 

1 Video Capturing with Camera Parameters . . 

Captures images and outputs raw video data and associated camera parameters, which might include depth 

data 

2. Format Conversion 

Converts raw data to uncompressed format 

3 " P^SS^Stion specific processing functions and outputs such application specific data in the uncom- 
pressed format. Some examples of application specific processing are listed below: 

i) Integration with computer graphics 

ii) Construction of 3D models and texture data 

iii) Processing to support view dependent coding 

iv) Processing to support display dependent coding 

v) Extraction of depth, disparity data 

4. Encoding 

Transforms uncompressed format data into compressed format 
5 Delivery 

Delivery of compressed data via media such as broadcasting, network, DVD, etc. 

6. 3D-Video Decoding 

Decoding of the compressed data and output of uncompressed format data 

7. Rendering 

Renders the video on the screen 

8 " GetTe^er's request to change the view point and view direction and transfer them to rendering part or 3D 
based processing part through backchannel. 
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Raw Video Data 
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Fig.3. Block diagram of 3D video system architecture 



1.2 Standardization items 

The section gives a high-level overview of what needs to be standardized for the different categories of 
representations (without claim to be exhaustive, complete or consistent). 

1. Ornni-directional (panoramic) video 

• Panoramic video signal itself 

• Non-planar image plane (sphere, cylinder, other) 

• Mapping to 2D image plane 

2. Interactive stereo video 

• (At least) one video signal 

• (3D) Shape and object information 

• Camera calibration information (might not be mandatory) 

• (Accurate) depth information (e.g. disparity data), possibly represented in new ways (might not be rr 
lory) 



1ST AVAILABLE COPY 



• Additional data, hidden layers (might not be mandatory) 

. ' Interactive multiple view video 

• N video signals 

• (3D) Shape and object information 

• Camera calibration mformation (might not be mandatory) 

. (Accurate) depth information (e.g. disparity data), possibly represented in new ways (might not be manda- 
tory) * 

• Additional data, hidden layers (might not be mandatory) 

k Interactive 3D Audio 

• N audio signals 

• Natural 3D audio itself 

• (3D) Microphone arrangement configuration (might not be mandatory) 

• (3D) Audio scene description/composition 

• Listening environment description 

• Geometrical co-location of 3D audio , and video 

• Additional data (might not be mandatory) 

1.3 Relation with available standards and on going activities 

It has been identified that there are several tools already available in different MPEG standards or vmder in- 
vestigation that are relevant .for 3DAV. This section gives an overview of these tools and an analysis of therr 
potential svutability. for the 3DAV framework. However, this needs norther investigation including technical 
evaluation and experimentation. 

1.3.1 Overview 

1. Omni-directional (panoramic) video . tr 

• This could be handled by 3D BIFS by mapping of a video onto inner surface of a sphere or cylinder. How- 
ever conventional coding of panoramic video might not be efficient special coding schemes taking into account 
spherical projection should be investigated. A special solution (without 3D BIFS) would be much more compact 
and easier (cheaper) to handle/implement. 

2. Interactive stereoscopic video . • ■ 

A multi-view profile is available in MPEG-2 but it does not support interactivity at all. The Multiple Auxil- 
iary Components (MAC) from MPEG-4 can be used for disparity data, representation with more than 8 bit and 
loss-less coding are possible. But MAC might not be efficient due to special nature of disparity data^New meth- 
ods for representation and coding of depth data should be investigated, e.g. layered depth images. Some of this 
technology is already under investigation in SNHC. Certain kinds of camera calibration mformation are avai - 
able in MPEG-4 andMPEG-7, however these do not satisfy the needs of interactive video m terms of functional- 
ity and accuracy. New descriptors in MPEG-7, Sensor Parameters and 3D Coordinates, might be appropriate : for 
interactive video, however these are under investigation in CE status, and the CE might be stopped due to lack of 
activity. Hidden layers could be coded as arbitrary shaped MPEG-4 VOPs, however due to the nature of the data 
(irregular shaped, many small pieces) this might not be efficient, new approaches should be investigated^ This 
means that some useful tools are already available or under investigation in MPEG. Some new should be devel- 
oped to increase efficiency. The elements are spread over different standards and groups. They need to be com- 
bined in a special framework for interactive stereo video. 

3. Interactive multiple view video j 
In addition to what is stated for interactive stereo video a framework needs to: be developed to accommodate 

the scene description with N calibrated cameras. 

4 ' MPEG4°advanced AudioBIFS can make interactive 3D sound scene. Sound and DirectiyeSound node spati- 
alize a sound object to 3D sound space. 



1.3.2 Depth image-based representation (.AFX) 



The MPEG-4 Animation Framework extension (AFX) defines two image-based depth representations for 
use in image-based rendenng. A sun P le Depth Image (DI) which is just an image with an associated depth map, 
a wtlUs a multivalued image and depth representation, called a Layered Depth image (LDl) A simple Dl al- 
ow si create novel views of°a scene b? means of 3D warping of the DI pixels. In addition to this, an^ws 
to store additional depth and color values for pixel that are occluded in the original view This extra data pro- 
vides the necessary information that is needed to fill disocclnded areas in the rendered, novel views. 
. Ud to now AFX specifies a Depthlmage structure, which consists of a computer-graphics centric camera 
definition (i.e. position, orientation, field of view, near clipping plane, far clipping plane etc.) and a pointer to a 
depua hSge. This can either be a SrmpleTexture (i.e. a DI) or a PointTexture (i.e. an LDI). For a SmrpleTexture 
• 2*hS and depth fields can be comprised of either an ImageTexture, a MovieTexture or a P^lTextux , as 
defined in the MPEG-4 Video/System documents. A PointTexture is comprised of a) a texture which stores for 
eacn SS the number of layers as well as the color values for each layer; b) a depth map which stores for each 
pSelS 4-byte depth values for each layer. In either case, the depth values should be normalised to .the distance 
from the near to the far clipping plane of the camera. oc> , . olmno 

The authors explicitly note that the format could also be used for animated objects (i.e. sequences) by storing 
sets of compressed video streams instead of images, together with 'streams' of depth maps. For compression, 
they simply state that still and video coding formats of MPEG-4 should be used for textures and depth maps. 

Source 

MPEG-4 Animation Framework extension (AFX) VM 6.0 (N4626) 



Status: 

Under consideration 



^T^m^L Image-Based Approach to Three-Dimensional Computer Graphics. PhD thesis, University 

of North Carolina at Chapel Hill, 1997. „ rrl! APW , n0 

M. Oliveira, G. Bishop, and D. McAllister. Relief Texture Mapping. In Proceedings of SIGGRAPH 00, 

pages 259-268, New Orleans, LA, USA, July 2000. _ qtoop APTT 'OR 

J. Shade, S. Gortler, L.-W. He, and R. Szeliski. Layered Depth Images. In Proceedings of SIGGRAPH 98, 

° rl N d L .^^Ind'A.Za^or. Constructing aMultivalued Representation for View Synthesis. International 
Journal of Computer Vision, 45(2): 157- 190, 2001. p*™v™» 
C -F Chang G. Bishop, and A. Lastra. LDI Tree: A Hierachical Representation for Image-Based Rendermg. 
In Proceedings of SIGGRAPH '99, Los Angeles, CA.USA, August 1999. 



TTTepth image-based representation, as currently under consideration within die MPEG-4 Amnion 
Framework extension (AFX), seems very much computer graphics oriented (see e ^ camer ^^°^ 
the current stage. Nevertheless it seems worthwhile to have a closer look at it, as both the simple Depth Image 
QSi Z lie Layered Depth Image (LDI) seem to fit very well into the 3DAV framework. A jojj discuss^ 
between the 3DAV and the AFX groups seems necessary and useful, as the proposed techniques still seem to be 
at Tve" eariy stage of development. Especially the AFX authors statement that 'for compression stdl and video 
coding formats of MPEG-4 should be used for textures and depth maps', seems rather vague and certainly needs 
more investigation and discussion. 

1.3.3 Multiview profile (MPEG-2) 

The MPEG-2 multiview profile (MVP) was defined in 1996 as an amendment to the MPEG-2' standard with 
the main application-area being stereoscopic TV. The MVP extends the well-known hybrid coding towards ex- 
ploitation of inter-viewchannel redundancies by implicitiy defining disparity-compensated prediction. The rnzm 
new elements are the definition of usage of the temporal scalability (TS) mode for multi-camera sequences and- 
fee defimtion of acquisition parameters in the MPEG-2 syntax. The TS mode was originally developed to allow 
• for the joint encoding of a low frame rate baselayer stream and an enhancement layer stream cornpnsed of addi- 
Snal video frames. If both streams are available an decoded, video could be reproduced with fall frame rate. In 
fee TS mode, temporal prediction of enhancement layer macroblocks could be performed either from a base 

layer frame, or from the recendy-reconstructed enhancement layer frame. 

For stereo or multichannel signals comprised of the video data captured simultaneously from tm> or more 
view, of the seen,., -it is straighforward to perform encoding using the TS syntax. For this purpose frames from 
one camera view are defined as the base layer, and frames from the other one(s) as enhancement layers,. ioe 



erihancement-from-base-layer prediction then turns out as a disparity-compensated prediction instead of a mo- 

tior.-compensated orediction. If the disparity-compensated prediction fails, it is still possible to achieve compres- 
sion by motion-compensated prediction within the same channel. At the same time, the base layer represents a 

monoscopic sequence. , . . lL ™ , c 

Unfortunatly, disparity vectors defined on a block-by-block basis of 16x16 pixels, as used m the TS mode of 
MPEG-2 are not accurate enough to minimize the inter-viewchannel prediction error to the possible extent. It 
can be observed that in many cases (with exception of high motion) the similarity between subsequent frames 
within one of the views is much higher than the similarity between the different views, such that the motion- 
compensated interframe prediction is most likely preferred over the disparity-compensated nater-viewchannel 
prediction. As a consequence, the temporal scalability concept can only be marginally superior. over a separate 
encoding (so-called simulcast) of the channels, both concepts requiring approximately doubled rate as compared 
to encoding a signal from a single camera. 

Source: 

Final Text of 1381 8-2/AMD 3 (MPEG-2 Multiview profile) (N1366) 

Status: 

Already standardized 

5 " g lS. e Ohn?Stereo/Multiview Encoding Using the MPEG Family of Standards. Invited Paper, In Proceedings 
of Electronic Imaging '99, San Diego, USA, January 1 999. 

The MPEG-2 multiview profile (MVP) was developed as a tool for the coding of stereoscopic and multiview 
video sequences. As such it exploits inter-viewchannel redundancies by implicitly defmmg disparity- 
compensated prediction. MVP was not developed for the coding of depth maps or disparity information and 
therefore doesn't seem to be very useful for the current 3DAV activities where it is intended to transmit the basic 
video data together with associated depth or disparity information The MPEG2 multiview profile is not 3D 
video because no geometric calibration data is associated with the video or hence no compression based on me 
- 3D geometric redundancy is employed. It is. also not possible to realize any viewpoint modification, i.e.-interac- 

tivity is not supported at all. .-j-i.it 

In the MPEG-2 MVP, basically temporal scalability is used and thus each view is earned in each layer. In 
this case synchronization between the views is achieved using timestamps. The multi-layer approach causes 
tittle problems in case of hardware decoders. However, view-synchronization based on the nmestamp mecha- 
nism is very difficult to implement in case of software players running on PC platforms. 

13.4 Multiple auxiliary components (MPEG-4) 

The basic idea of the Multiple Auxiliary Components (MAC) is that grayscale shape is not only used to de- 
scribe the transparency of the video object, but can be defined in a more general way. MACs are defined for a 
video object plane (VOP) on a pixel-by-pixel basis, and contain data related to the video object, such as disparity, 
depth, and additional texture. Up to three auxiliary components (including the grayscale or alpha shape) are pos- 
sible Only a limited number of types and combinations are defined and identified by a 4-bit mteger so far but 
more applications are possible by selection of a USER DEFINED type or by definition of new types. All the 
auxiliary components can be encoded by the shape coding tools, i.e. the binary shape coding tool and the gray 
scale shape coding tool which employs a motion-compensated DCT, and usually have the same shape and reso- 
lution as the texture of the video object. 

Source: 

Text of ISO/DEC 14496-2 (MPEG-4 Visual) 2001 Edition (N4350) 

Status: 

Already standardized ' 

Suggested reading: . , _ T _ 

J.-R. Ohm. Stereo/Multiview Encoding Using the MPEG Family of Standards. Invited Paper, In Proceedings 

of Electronic Imaging '99, San Diego, USA, January 1999. 

J.-R. Ohm, and K. Mailer. Core Experiments on Multiview Objects. Doc. number MM 78, February 1996. 
R. Koenen. MPEG-4 Overview. Doc. number N4030, March 2001 . 
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TlieMPEG-4 multiple auxiliary components (MAG) is a generalization of the grayscale, snap e coding. As 
such there are a number of possible shortcomings when used to encode depth maps or disparity information. 
First the usage of MAC implies that the binary shape of the video object has to be transmitted. If there is none, 
the shape of the rectangular screen has to be encoded which is quite a waste of bits. Second, depth maps or dis- 
parity information have very specific characteristics and it is questionable if the artifacts introduced through 
DCT encoding and quantization are tolerable with respect to the quality of the synthesized views. Third, while it 
seems useful to use the texture motion vectors to compensate the depth or disparity sequence, so that one doesn't 
have-to transmit a whole new lot of motion vectors for-the auxiliary sequence, it is questionable if texture motion 
fields and depth or disparity motion fields are indeed always as close as one might think (e.g. think about the 
cases where texture changes due to illumination changes, while the scene geometry and thus the depth or dispar- 
ity basically stays the same). If this is not the case, using texture motion vectors to predict depth or disparity im- 
ages might lead to a very costly differential coding. 

Despite these possible shortcomings, it seems useful to have a closer look at the MAC to see if they could be 
extended to better fit the needs of the 3D AV activities. 

While MAC can be used to synthesize such object images that can be observed from intermediate viewpoints 
between the cameras and integrate them with 3D graphics data, it is not a full 3D representation, because no ex- 
plicit 3D geometric calibration data is associated. A disparity map is incomplete 3D information. It is question- 
able if such a method would provide a sufficient compression efficiency. 

1.3.5 Light Field Mapping (MPEG-4) 

At Pattaya meeting, R.Grzeszcuk et al (M7604) proposed Surface Light Field Mapping. It defines a coding 
method for multiview object images associated with precise geometric calibration data. Instead of incomplete : 3D 
information such as disparity, it includes an explicit 3D object shape represented by triangular patches based on 
which textures on the patches can be generated depending on interactively specified viewpoints. Such explicit 
3D representation enables very high data compression rate (100 - 3000 : 1). That is, while a large amount of 
image data are captured, they have significant 3D spatial redundancy in addition to 2D surface (texture) redun- 
dancy each point on the object surface is recorded many times in multiview images. 

While' the surface light field mapping method is effective for STATIC 3D objects, it cannot be applied to 
moving objects, because the object shape and pose change dynamically. 

1.3.6 Other tools to be investigated 

The following tools need further investigation regarding their relevance and suitability for 3DAV: 

• 3D-BIFS for video and audio 

• Camera geometry information based on SMTPE3 1 5M 

• Sensor Parameters and 3D Coordinates (proposed descriptors) 

• MPEG-7 Camera Motion descriptor 

2 Applications and products 
2.1 Available today 

The following list shows examples of applications available today on the market. Some of them are not di- 
rectly related to 3DAV, however, they provide some sense of what 3D-AV is trying to realize. 

1. Camera(+System) 

Forth View (Sony) - System for PS2 . 
h1±p://www.sonv.co.in/^oducts/fourthview/ 

Zcam (3DV systems) - Camera with Depth Data . 
htt p-//www.3dvsvstems.com/ 



DigiclopsTM (Point Grey Research) 



htt p://ptgrev.com 

FLY CASVl(Fuji Xerox). 

htt p ://www. ubiquitous-media, com 

Panorama Video Cam (Sharp) 

htt p ://www. sharp . c o . jp/c orporate/n e ws/O 10919-2 .html 
EYE Vision (CBS, Mitsubishi Heavy Industry) 

htt p-://www.sdia.or.ro^ - - 

CAM-4000 3D Camera (VREX) 

htt p i/www. vrex.corn/products/cm 40Q0.shtml 

2. Display 

BOOM 3C (Fakespace) 

htt p://www.fakespacelabsxom / nroducte/boorn3c.htrnl 
Sanyo 3D Screen (Sanyo) 

htt p few, sanvo . co . ip/koho/hvpertext4/0 1 09ne ws -i/09 1 2- 1 .html 

3D-LCD (Philips) 
ht1 p://www.research.pMl^ 

3D TFT-LCD (Samsung) 

hrtp://www.sec.sarnsung.com/news/digital media/ 

Cyberbook, Stereo 3D-notebook (VREX) 
htl p://www.vrexxom/products/rnicTOp 

21 MX(Nuvision) 

htt p://www.nuvision3dxom/shutters.html 

3. 3D glasses 

StereoGraphics Corporation 
http://www.stereographics.com/ 

Iart3d 

http://www.iart3d.com/ 
Anotherworld 

http ://www. anotherworld. to/ 

i-O Display Systems 
http ://www. i- glasses .com/ 

VHJoy 

. http://www.vrstandard.com/ 

4. Panorama view software ....... 

QuickTime VR(Apple Computer) - divergent/convergent view still picture 
http://www.a-pT3le.com 

Be Here Technologies - divergent view movie 
http :// www .behere.com/ 

IPIX (Internet Pictures Corporation) - divergent view sriu picture 



# 



http://www.iDix.com/ 



2.2 Possible applications in the near future 



2.2.1 Multiple-view video 

Interactively changing viewpoint and view direction in 3D video data. Any existing 2D video "application can 
be replaced with 3D. 

For very instance, viewing interactively changing viewpoint and view direction from cameras on players m 
football games, or the same. from cameras on cars or drivers in Fl circuit. This may be more interesting in live 
broadcast, and more at multi-user communication environment. 

Note: with surrounding display (3D display), interactivity is not necessarily mandatory. (But anyway this is 
one case subject to interactivity). 




Fig.4. Recording of NohPlay with multiple views 
Possible application domains are: 



1. Entertainment 

• Concert 

• Sport 

• Disco 

• Multi-user Game 

• Movie 



2. Education 

• Cultural Archives 

• Manual with real video 

• Instruction of sports playing 

3. Medical surgery 

4. Viewing with Exploration ' 

• Zoo, Aquarium, Botanical garden 

• Museum 

• Catalogue with real video (3D TV shopping) 

5 . Communication 

6. Sightseeing 
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7. Surveillance 

2.2.2 Stereoscopic (interactive) video 




Fig.5. Examples of stereoscopic video 




Fig.6. Recording scenery with stereo orrrni-direction video 



1 Sports broadcast or webcasting ' 

• There were already the exhibition broadcast in the Olympic games of Sydney and Nagano. 

• ETRI has a plan to broadcast the games in 2002 World cup of Korea- Japan, experimentally. 

• In the case of soccer game, we can have a feeling as if we were really m the stadium 

2. Education 

• Driving, flight simulation, anatomy, molecular structure 

• Provide the reality in remote education. 

• Provide the cost effectiveness in some areas that need expensive material and apparatus. 

. In the case of driving education, beginner can practice without feeling the risk of accident 

. In the case of molecular structure study, it is easier to recognize perspective relations among molecules. 

3. Entertainment 

• Movie, game, home shopping, sight seeing 

• There are many stereoscopic theaters in the amusement park 

. There are many game products supporting stereoscopic functionality in the commercial market. 

• we can enjoy more realistic impression in movie, game, sigbt seeing, etc. 

• In the case of home shopping, stereoscopic impression gives consumer more visual information of products. 

2.2.3 Broadcast 3D video 

Probablv TO will be the next maior revolution in the history of TV. Both at professional and consumer elec- 
tronics exhibitions, companies are eager to show then new 3D products that always attract a lot of mteresi. Go- 
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viously.if a workable and commercially acceptable solution can be found, the introduction wiH gener- 

ate a huse replacement market for the current 2D -TV sets: In this decade, it can be expect that technology wjl 
have progressed far enough to make a fail 3D-TV application available on the mass consumer market, including 
content generation, coding, transmission and display. - ^Tv^t.™ 

The ATTEST consortium works towards a flexible, 2D-compatible and commercially feasible 3 D-TV system 
for broadcast environments. The consortium consists of 8 European partners (e^. Philips; and 
. As the main depth cue comes from stereovision, it seems natural to record and distribute 3D broadcast sig- 
nals as separate video streams for each eye. Hence existing trials for the introduction of 3D-TV are based on this 
idea of 'stereoscopic' video. This approach is merely broadcast production centered and restricted m some sense 
Sector has full control over-tne depth effect; the viewer has to accept the effect as provided. Backwards, 
compatibility is not necessarily supported and there is no possibility to change the viewmg conditions, to scale 
the grade of depth perception or to adapt to different kinds of mono, stereo or muta-view displays. 

£ contrast, to cope with the different aspects of compatibility, scalability and adaptability, the ATTEST ap- 
proach to 3D-TV is quite different from former ones. The core is a flexible and scalable syntax for image-based 
?D data representation, which will be open for different display types and viewing condiUons (see ^^J)^ 
purpose of this syntax is to introduce 3D video.as a combination of regular video (2D base flayer) and synchro- 
nized image-based depth information (3D enhancement layers). Due to this structure our 3D-TV approach will 
be backward compatible to existing 2D video services. 
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Fig 7 The layered coding syntax provides backward compatibility to conventional 2D digital TV and allows to 
adapt the view synthesis to a wide range of different 2D and 3D displays. 

It will further be possible to reconstruct and to control multiple virtual views, supporting a wide variety of 
tracked, mono, stereo and multi-view displays. Hence, the system will be scalable in terms of rece 1V er complex- 
ity - an important issue to introduce 3D-TV in an evolutionary manner. Due to the usage of 3D enhancement 
layers the syntax will also provide scalability in terms of depth experience. This is particularly important be- 
cause perception studies have indicated that there are differences in depth appreciation over age groups_ Hence in 
our view, the viewer should be in control of his depth experience. He should be able to set the depth level ac- 
cording to his personal preference - a feature that can also be used as graceful degradation in the case of unex- 
pected artifacts in depth which are usually more annoying in stereovision than in parallax viewmg. 
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3DAV requirements 



Note: The requirements need to be clustered in a meaningful way. e.g. video, audio, systems, etc. 

1. Uncompressed representation format of 3D video 

3DAV snail define an uncompressed format for representation of 3D video content. 

2. Compressed format of 3D video 

3D AV shall provide a compressed format for exchanging 3D video content between different systems. 

3. Uncompressed representation format of 3D audio 

3DAV shall define an uncompressed format for representation of 3D audio content. 

4. Compressed format of 3D audio 

3DAV shall provide a compressed format for exchanging 3D audio content between different systems. 

5. Extrinsic and intrinsic camera parameters 

Camera parameters shall be described which contribute to the reconstruction of 3D views. This shall include 
extrinsic parameters such as 3D position and angle as well as mtrinsic parameters (focus, aspect ratio, etc.). This 
shaU enable full geometric calibration of the imaging system in 3D. Dynamic changes shall also be represented 
(e.g. camera rotation and translation, zoom). Those parameters may include: 

• Camera Models 

Geometric Irrformation 

■ Photometric Information 

■ Temporal Irrformation 

■ Screen definitions 

• Camera Types 

■ Pin-hole camera 

■ Thin lens camera 

■ Thick lens camera 
Fish-eye lens camera 

■ Camera with lens and mirror(s) 

• Camera Works 

■ Motion (rotation, transition, position) 

■ Zooming 

■ Focusing 

■ Iris and gain control 

6. Non-planar imaging and display systems 

3DAV shall support efficient representation and coding methods for non-planar imaging and display systems. 
This shall include for instance cylindrical spherical image planes. These methods should be designed taking into 
account that the video can be projected easily and efficiently onto non-planar screens. 

7. Multiple Views 

Multiple views of a scene shall be described. This shall include stereoscopic views, i.e. views with associated 
depth or disparity data. 

8. Synchronization 

Accurate temporal synchronization between the multiple views shall be supported. 

9. Reuse of existing tools 

Wherever possible existing MPEG tools shall be reused: ...... 

10. Integration with SNHC computer graphics objects 

The 3DAV framework shall allow integration of existing SNHC computer graphics objects. 

11. Interactivity 

Interactive change of viewpoint and view angle shall be supported. This shall include local interactivity at the 

decoder and remote interactivity" between decoder and encoder. The latter requires a bsckchannel. 



12. Disparity, depth information 

Efficient representation and coding of depth maps (e.g. disparity data) should be supported, enabling loss -less 
reconstruction, highly accurate depth representation and efficient compression at the same time. 5ucn aaca are 
used to reconstruct a stereo-pair VOP. 

13. 3D objects ■ 

3D objects shall be supported for handling several stereoscopic or multroew objects rn a scene. 

14. Backwards compatibility 

... Backwards cornpatibility with MPEG-2 video should.be supported, since 2D and 3D broadcast will co-exist 
while introducing 3D-TV. 

15. Compression efficiency - , , , 
3DAV shall provide high compression efficiency for a wide range of applications. This shall include broad- 
cast as well as mobile communication scenarios. The overhead by additional 3D data should be limited (to e.g. 
20%), in order to increase acceptance of the new services. 

16. Performance efficiency 

3DAV shall be efficient in terms of computational complexity. 

17. Occlusion handling . . ; 
Efficient representation and coding of multi-label masks and hidden layers for occlusion handling rn interac- 
tive stereo video should be supported. 

18. Different display types 

Different 3D displays should be supported. This shall include conventional 2D displays, field- and rrarne- 
based shuttering displays, conventional stereo vision (non-tracked, e.g. using glasses), head motion parallax 
viewing on 2D displays, single-user viewing including head-motion parallax (head-mounted displays, auto- 
stereoscopic displays), multi-user auto-stereoscopic viewing on large screens. This shall also enable the user to 
change the display interactively. 

19. Scalability 

Complexity (i.e. cost) scalability of the end user terminals shall be supported. 



